Search CORE

124 research outputs found

Real-Time Detection of Dust Devils from Pressure Readings

Author: Wagstaff Kiri
Publication venue
Publication date
Field of study

A method for real-time detection of dust devils at a given location is based on identifying the abrupt, temporary decreases in atmospheric pressure that are characteristic of dust devils as they travel through that location. The method was conceived for use in a study of dust devils on the Martian surface, where bandwidth limitations encourage the transmission of only those blocks of data that are most likely to contain information about features of interest, such as dust devils. The method, which is a form of intelligent data compression, could readily be adapted to use for the same purpose in scientific investigation of dust devils on Earth. In this method, the readings of an atmospheric- pressure sensor are repeatedly digitized, recorded, and processed by an algorithm that looks for extreme deviations from a continually updated model of the current pressure environment. The question in formulating the algorithm is how to model current normal observations and what minimum magnitude deviation can be considered sufficiently anomalous as to indicate the presence of a dust devil. There is no single, simple answer to this question: any answer necessarily entails a compromise between false detections and misses. For the original Mars application, the answer was sought through analysis of sliding time windows of digitized pressure readings. Windows of 5-, 10-, and 15-minute durations were considered. The windows were advanced in increments of 30 seconds. Increments of other sizes can also be used, but computational cost increases as the increment decreases and analysis is performed more frequently. Pressure models were defined using a polynomial fit to the data within the windows. For example, the figure depicts pressure readings from a 10-minute window wherein the model was defined by a third-degree polynomial fit to the readings and dust devils were identified as negative deviations larger than both 3 standard deviations (from the mean) and 0.05 mbar in magnitude. An algorithm embodying the detection scheme of this example was found to yield a miss rate of just 8 percent and a false-detection rate of 57 percent when evaluated on historical pressure-sensor data collected by the Mars Pathfinder lander. Since dust devils occur infrequently over the course of a mission, prioritizing observations that contain successful detections could greatly conserve bandwidth allocated to a given mission. This technique can be used on future Mars landers and rovers, such as Mars Phoenix and the Mars Science Laboratory

NASA Technical Reports Server

Automated Classification to Improve the Efficiency of Weeding Library Collections

Author: Wagstaff Kiri Lou
Publication venue: SJSU ScholarWorks
Publication date: 01/01/2017
Field of study

Studies have shown that library weeding (the selective removal of unused, worn, outdated, or irrelevant items) benefits patrons and increases circulation rates. However, the time required to review the collection and make weeding decisions presents a formidable obstacle. In this study, we empirically evaluated methods for automatically classifying weeding candidates. A data set containing 80,346 items from a large-scale academic library weeding project by Wesleyan University from 2011 to 2014 was used to train six machine learning classifiers to predict “Keep” or “Weed” for each candidate. We found statistically significant agreement (p = 0.001) between classifier predictions and librarian judgments for all classifier types. The naive Bayes and linear support vector machine classifiers had the highest recall (fraction of items weeded by librarians that were identified by the algorithm), while the k-nearest-neighbor classifier had the highest precision (fraction of recommended candidates that librarians had chosen to weed). The most relevant variables were found to be librarian and faculty votes for retention, item age, and the presence of copies in other libraries. Future weeding projects could use the same approach to train a model to quickly identify the candidates most likely to be withdrawn

SJSU ScholarWorks

Progressive Classification Using Support Vector Machines

Author: Kocurek Michael
Wagstaff Kiri
Publication venue
Publication date
Field of study

An algorithm for progressive classification of data, analogous to progressive rendering of images, makes it possible to compromise between speed and accuracy. This algorithm uses support vector machines (SVMs) to classify data. An SVM is a machine learning algorithm that builds a mathematical model of the desired classification concept by identifying the critical data points, called support vectors. Coarse approximations to the concept require only a few support vectors, while precise, highly accurate models require far more support vectors. Once the model has been constructed, the SVM can be applied to new observations. The cost of classifying a new observation is proportional to the number of support vectors in the model. When computational resources are limited, an SVM of the appropriate complexity can be produced. However, if the constraints are not known when the model is constructed, or if they can change over time, a method for adaptively responding to the current resource constraints is required. This capability is particularly relevant for spacecraft (or any other real-time systems) that perform onboard data analysis. The new algorithm enables the fast, interactive application of an SVM classifier to a new set of data. The classification process achieved by this algorithm is characterized as progressive because a coarse approximation to the true classification is generated rapidly and thereafter iteratively refined. The algorithm uses two SVMs: (1) a fast, approximate one and (2) slow, highly accurate one. New data are initially classified by the fast SVM, producing a baseline approximate classification. For each classified data point, the algorithm calculates a confidence index that indicates the likelihood that it was classified correctly in the first pass. Next, the data points are sorted by their confidence indices and progressively reclassified by the slower, more accurate SVM, starting with the items most likely to be incorrectly classified. The user can halt this reclassification process at any point, thereby obtaining the best possible result for a given amount of computation time. Alternatively, the results can be displayed as they are generated, providing the user with real-time feedback about the current accuracy of classification

NASA Technical Reports Server

Active Learning with Irrelevant Examples

Author: Mazzoni Dominic
Wagstaff Kiri
Publication venue
Publication date
Field of study

An improved active learning method has been devised for training data classifiers. One example of a data classifier is the algorithm used by the United States Postal Service since the 1960s to recognize scans of handwritten digits for processing zip codes. Active learning algorithms enable rapid training with minimal investment of time on the part of human experts to provide training examples consisting of correctly classified (labeled) input data. They function by identifying which examples would be most profitable for a human expert to label. The goal is to maximize classifier accuracy while minimizing the number of examples the expert must label. Although there are several well-established methods for active learning, they may not operate well when irrelevant examples are present in the data set. That is, they may select an item for labeling that the expert simply cannot assign to any of the valid classes. In the context of classifying handwritten digits, the irrelevant items may include stray marks, smudges, and mis-scans. Querying the expert about these items results in wasted time or erroneous labels, if the expert is forced to assign the item to one of the valid classes. In contrast, the new algorithm provides a specific mechanism for avoiding querying the irrelevant items. This algorithm has two components: an active learner (which could be a conventional active learning algorithm) and a relevance classifier. The combination of these components yields a method, denoted Relevance Bias, that enables the active learner to avoid querying irrelevant data so as to increase its learning rate and efficiency when irrelevant items are present. The algorithm collects irrelevant data in a set of rejected examples, then trains the relevance classifier to distinguish between labeled (relevant) training examples and the rejected ones. The active learner combines its ranking of the items with the probability that they are relevant to yield a final decision about which item to present to the expert for labeling. Experiments on several data sets have demonstrated that the Relevance Bias approach significantly decreases the number of irrelevant items queried and also accelerates learning speed

NASA Technical Reports Server

Online classification for time-domain astronomy

Author: Lo Kitty K.
Murphy Tara
Rebbapragada Umaa
Wagstaff Kiri
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/02/2014
Field of study

The advent of synoptic sky surveys has spurred the development of techniques for real-time classification of astronomical sources in order to ensure timely follow-up with appropriate instruments. Previous work has focused on algorithm selection or improved light curve representations, and naively convert light curves into structured feature sets without regard for the time span or phase of the light curves. In this paper, we highlight the violation of a fundamental machine learning assumption that occurs when archival light curves with long observational time spans are used to train classifiers that are applied to light curves with fewer observations. We propose two solutions to deal with the mismatch in the time spans of training and test light curves. The first is the use of classifier committees where each classifier is trained on light curves of different observational time spans. Only the committee member whose training set matches the test light curve time span is invoked for classification. The second solution uses hierarchical classifiers that are able to predict source types both individually and by sub-group, so that the user can trade-off an earlier, more robust classification with classification granularity. We test both methods using light curves from the MACHO survey, and demonstrate their usefulness in improving performance over similar methods that naively train on all available archival data.Comment: Astroinformatics workshop, IEEE International Conference on Data Mining 201

arXiv.org e-Print Archive

Crossref

Salience Assignment for Multiple-Instance Data and Its Application to Crop Yield Prediction

Author: Lane Terran
Wagstaff Kiri L.
Publication venue
Publication date
Field of study

An algorithm was developed to generate crop yield predictions from orbital remote sensing observations, by analyzing thousands of pixels per county and the associated historical crop yield data for those counties. The algorithm determines which pixels contain which crop. Since each known yield value is associated with thousands of individual pixels, this is a multiple instance learning problem. Because individual crop growth is related to the resulting yield, this relationship has been leveraged to identify pixels that are individually related to corn, wheat, cotton, and soybean yield. Those that have the strongest relationship to a given crop s yield values are most likely to contain fields with that crop. Remote sensing time series data (a new observation every 8 days) was examined for each pixel, which contains information for that pixel s growth curve, peak greenness, and other relevant features. An alternating-projection (AP) technique was used to first estimate the "salience" of each pixel, with respect to the given target (crop yield), and then those estimates were used to build a regression model that relates input data (remote sensing observations) to the target. This is achieved by constructing an exemplar for each crop in each county that is a weighted average of all the pixels within the county; the pixels are weighted according to the salience values. The new regression model estimate then informs the next estimate of the salience values. By iterating between these two steps, the algorithm converges to a stable estimate of both the salience of each pixel and the regression model. The salience values indicate which pixels are most relevant to each crop under consideration

NASA Technical Reports Server

Visualizing Image Content to Explain Novel Image Discovery

Author: Lee Jake H.
Wagstaff Kiri L.
Publication venue
Publication date: 14/08/2019
Field of study

The initial analysis of any large data set can be divided into two phases: (1) the identification of common trends or patterns and (2) the identification of anomalies or outliers that deviate from those trends. We focus on the goal of detecting observations with novel content, which can alert us to artifacts in the data set or, potentially, the discovery of previously unknown phenomena. To aid in interpreting and diagnosing the novel aspect of these selected observations, we recommend the use of novelty detection methods that generate explanations. In the context of large image data sets, these explanations should highlight what aspect of a given image is new (color, shape, texture, content) in a human-comprehensible form. We propose DEMUD-VIS, the first method for providing visual explanations of novel image content by employing a convolutional neural network (CNN) to extract image features, a method that uses reconstruction error to detect novel content, and an up-convolutional network to convert CNN feature representations back into image space. We demonstrate this approach on diverse images from ImageNet, freshwater streams, and the surface of Mars.Comment: Under Revie

arXiv.org e-Print Archive

Semi-Supervised Eigenbasis Novelty Detection

Author: Thompson David R.
Wagstaff Kiri L.
Publication venue
Publication date
Field of study

Recent discoveries in high-time-resolution radio astronomy data have focused attention on a new class of events. Fast transients are rare pulses of radio frequency energy lasting from microseconds to seconds that might be produced by a variety of exotic astrophysical phenomena. For example, X-ray bursts, neutron stars, and active galactic nuclei are all possible sources of short-duration, transient radio signals. It is difficult to anticipate where such signals might appear, and they are most commonly discovered through analysis of high-time- resolution data that had been collected for other purposes. Transients are often faint and difficult to detect, so improved detection algorithms can directly benefit the science yield of all such commensal monitoring. A new detection algorithm learns a low-dimensional linear manifold for describing the normal data. High reconstruction error indicates a novel signal that does not match the patterns of normal data. One unsupervised portion of the manifold model adapts its representation in response to recent data. A second supervised portion of the model is made of a basis trained in advance using labeled examples of RFI; this prevents false positives due to these events. For a linear model, an orthonormalization operation is used to combine these bases prior to the anomaly detection decision. Another novel aspect of the approach lies in combining basis vectors learned in an unsupervised, online fashion from the data stream with supervised basis vectors learned in advance from known examples of false alarms. Adaptive, data-driven detection is achieved that is also informed by existing domain knowledge about signals that may be statistically anomalous, but are not interesting and should therefore be ignored. The method was evaluated using data from the Parkes Multibeam Survey. This data set was originally collected to search for pulsars, which are astronomical sources that emit radio pulses at regular periods. However, several non-pulsar anomalies have recently been discovered in this dataset, making it a compelling test case. By explicitly filtering known false alarm patterns, the approach yields significantly better performance than current transient detection methods

NASA Technical Reports Server